In this post, I will show how to build a simple Machine Learning model classifier that can be used to help investors choose which stocks to invest in.
This article, although introductory, is targeted at people with some Python and ML knowledge. If you are interested in a more high-level explanation of ML and Trading, check out this post.
The following topics will be covered:
The data and code used in this article are available on my GitHub page. Feel free to clone the project to follow along.
In the repository, you will also find a README.md explaining how to set up an environment with all the necessary dependencies.
# load lab_black for easy code formating
%load_ext lab_black
To make it easier to follow along, I have created a sample dataset containing historical data for the 30 Dow Jones constituents. This dataset has already been aggregated at month level and some financial ratios have also been created.
The raw data comes from Tiingo, a company that offers financial data APIs.
Let's load the dataset into Pandas.
Here is the meaning of the columns:
import pandas as pd
pd.options.mode.chained_assignment = None
# read dataset
df = pd.read_csv("data\dataset.csv", parse_dates=["date"])
# format index
df = df.set_index(["ticker", "date"])
# display data
df
| adjOpen | adjClose | price_rate_of_change_1M | price_rate_of_change_3M | epsDil | return_on_assets | return_on_equity | price_to_earnings_ratio | debt_to_equity_ratio | ||
|---|---|---|---|---|---|---|---|---|---|---|
| ticker | date | |||||||||
| AAPL | 2000-01-31 | 0.801664 | 0.793102 | 0.009143 | 0.294933 | 0.006 | 0.021507 | 0.035760 | 132.183691 | 0.662693 |
| 2000-02-29 | 0.795013 | 0.876196 | 0.104771 | 0.171145 | 0.009 | 0.024123 | 0.041459 | 97.355147 | 0.718623 | |
| 2000-03-31 | 0.906315 | 1.038180 | 0.184872 | 0.320980 | 0.009 | 0.024123 | 0.041459 | 115.353363 | 0.718623 | |
| 2000-04-30 | 1.035811 | 0.948359 | -0.086518 | 0.195759 | 0.009 | 0.024123 | 0.041459 | 105.373229 | 0.718623 | |
| 2000-05-31 | 0.954551 | 0.642126 | -0.322908 | -0.267144 | 0.011 | 0.033252 | 0.055279 | 58.375098 | 0.662396 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| WMT | 2020-08-31 | 126.402983 | 135.654958 | 0.077424 | 0.123800 | 1.400 | 0.017132 | 0.058470 | 96.896398 | 2.326817 |
| 2020-09-30 | 137.950883 | 136.690566 | 0.007634 | 0.172842 | 2.270 | 0.027281 | 0.085991 | 60.216109 | 2.073895 | |
| 2020-10-31 | 137.560087 | 135.557259 | -0.008291 | 0.076648 | 2.270 | 0.027281 | 0.085991 | 59.716854 | 2.073895 | |
| 2020-11-30 | 137.354919 | 149.274188 | 0.101189 | 0.100396 | 2.270 | 0.027281 | 0.085991 | 65.759554 | 2.073895 | |
| 2020-12-31 | 150.065549 | 141.350206 | -0.053083 | 0.034089 | 1.800 | 0.020469 | 0.063060 | 78.527892 | 2.006103 |
7160 rows × 9 columns
We are going to build a classifier model, where the target will be a boolean variable indicating if the stock has grown by more than X% in the last month.
Before doing that, we take the following assumptions:
The target will then take the following values:
True if the stock's return is higher or equal than X%False otherwiseThe choice of the target threshold X is something that can be determined using experiments or simulations on past data. Here we will keep it nice and simple and use a fixed threshold of 5%.
Feel free to experiment with other thresholds (10%, 20%…) or even build a more complicated target using take-profits and stop-losses like in the Triple Barrier method suggested by de Prado (Advances in Financial Machine Learning, Marcos Lopez de Prado).
Here is the code for creating the target.
# if the price increases by more than x%, we label it as "True" or "Buy"
threshold = 0.05 # 5%
# calculate the return within the month
df["return_month"] = (df["adjClose"] / df["adjOpen"]) - 1
# create the target
df["target"] = df["return_month"] >= threshold
# display data
df[["adjOpen", "adjClose", "return_month", "target"]]
| adjOpen | adjClose | return_month | target | ||
|---|---|---|---|---|---|
| ticker | date | ||||
| AAPL | 2000-01-31 | 0.801664 | 0.793102 | -0.010680 | False |
| 2000-02-29 | 0.795013 | 0.876196 | 0.102115 | True | |
| 2000-03-31 | 0.906315 | 1.038180 | 0.145496 | True | |
| 2000-04-30 | 1.035811 | 0.948359 | -0.084428 | False | |
| 2000-05-31 | 0.954551 | 0.642126 | -0.327300 | False | |
| ... | ... | ... | ... | ... | ... |
| WMT | 2020-08-31 | 126.402983 | 135.654958 | 0.073194 | True |
| 2020-09-30 | 137.950883 | 136.690566 | -0.009136 | False | |
| 2020-10-31 | 137.560087 | 135.557259 | -0.014560 | False | |
| 2020-11-30 | 137.354919 | 149.274188 | 0.086777 | True | |
| 2020-12-31 | 150.065549 | 141.350206 | -0.058077 | False |
7160 rows × 4 columns
To simplify the notebook, I have already pre-computed some features using the raw data. If you are interested to know how I created them, let me know in the comments.
We will use the following features for building the model:
A very important step here is to shift the value of the features by one period.
Why do we do this?
Because the actual values of those features are only known at the end of the month. We need to make sure that the input data (the features) we use for predicting the target is available at the beginning of the month. By shifting the features' values by one period, we make sure that no data leakage is created.
# list of features
features = [
"price_rate_of_change_1M",
"price_rate_of_change_3M",
"epsDil",
"return_on_assets",
"return_on_equity",
"price_to_earnings_ratio",
"debt_to_equity_ratio",
]
# shift the value of the features by one period (make sure to use groupby!)
df[features] = df.groupby("ticker")[features].shift(1)
We then need to remove the first row for each ticker, to remove the NaN created during the shift() operation.
# remove the first row for each ticker to get rid of the NaN created after doing the shift
df = df.loc[df.groupby("ticker").cumcount() > 0]
# display data
df[features + ["target"]]
| price_rate_of_change_1M | price_rate_of_change_3M | epsDil | return_on_assets | return_on_equity | price_to_earnings_ratio | debt_to_equity_ratio | target | ||
|---|---|---|---|---|---|---|---|---|---|
| ticker | date | ||||||||
| AAPL | 2000-02-29 | 0.009143 | 0.294933 | 0.006 | 0.021507 | 0.035760 | 132.183691 | 0.662693 | True |
| 2000-03-31 | 0.104771 | 0.171145 | 0.009 | 0.024123 | 0.041459 | 97.355147 | 0.718623 | True | |
| 2000-04-30 | 0.184872 | 0.320980 | 0.009 | 0.024123 | 0.041459 | 115.353363 | 0.718623 | False | |
| 2000-05-31 | -0.086518 | 0.195759 | 0.009 | 0.024123 | 0.041459 | 105.373229 | 0.718623 | False | |
| 2000-06-30 | -0.322908 | -0.267144 | 0.011 | 0.033252 | 0.055279 | 58.375098 | 0.662396 | True | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| WMT | 2020-08-31 | 0.080314 | 0.069299 | 1.400 | 0.017132 | 0.058470 | 89.933393 | 2.326817 | True |
| 2020-09-30 | 0.077424 | 0.123800 | 1.400 | 0.017132 | 0.058470 | 96.896398 | 2.326817 | False | |
| 2020-10-31 | 0.007634 | 0.172842 | 2.270 | 0.027281 | 0.085991 | 60.216109 | 2.073895 | False | |
| 2020-11-30 | -0.008291 | 0.076648 | 2.270 | 0.027281 | 0.085991 | 59.716854 | 2.073895 | True | |
| 2020-12-31 | 0.101189 | 0.100396 | 2.270 | 0.027281 | 0.085991 | 65.759554 | 2.073895 | False |
7130 rows × 8 columns
To do a quick data exploration, we will use the Pandas Profiling library. This library is super nice for building profiling reports in just one line of code, as shown below.
We get this nice interactive report, feel free to scroll through it in the notebook.
From the report, we can see that the data is quite clean overall.
It is important to notice that the target distribution is unbalanced, something to keep in mind during the modeling part!
We also see some outliers for the columns price_to_earnings_ratio and debt_to_equity_ratio. I am not going to dig deeper into this now, however, this is
something interesting to look at at a later stage. Removing or fixing those outliers might improve the performance of the model.
from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report", minimal=True)
profile.to_notebook_iframe()
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
From the report, we can see that the data is quite clean overall.
It is important to notice that our target distribution is unbalanced, something to keep in mind during the modeling part!
We also see some outliers for the columns price_to_earnings_ratio and debt_to_equity_ratio. I am not going to dig deeper into this now, however this is definitely something interesting to look at in a later stage. Removing or fixing those outliers might improve the performance of the model, let's remember that data quality is very important for Machine Learning!
Now that we have a dataset to work with, we are going to build a simple classifier model using LightGBM.
LightGBM is a gradient boosting framework that uses tree-based learning algorithms.
I like LightGBM because of its high accuracy, fast training, and most importantly its ease of use. Indeed, you barely need any pre-processing to make a LightGBM model work, categorical features or NaN are taken care of automatically for example.
As it is tree-based, it is also better for the environment (😁) and easier to explain and visualize for non-technical people, leading to more acceptance of the model by the stakeholders.
For the sake of simplicity, we will use a simple train/test split and will leave out any more complicated cross-validation procedures.
Data for the year 2020 will be used for testing, and data before that will be used for training.
split_date = 2020
df_train = df.loc[df.index.get_level_values("date").year < split_date]
df_test = df.loc[df.index.get_level_values("date").year == split_date]
# show train data
print(
df_train.index.get_level_values("date").min(),
df_train.index.get_level_values("date").max(),
)
df_train
2000-02-29 00:00:00 2019-12-31 00:00:00
| adjOpen | adjClose | price_rate_of_change_1M | price_rate_of_change_3M | epsDil | return_on_assets | return_on_equity | price_to_earnings_ratio | debt_to_equity_ratio | return_month | target | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ticker | date | |||||||||||
| AAPL | 2000-02-29 | 0.795013 | 0.876196 | 0.009143 | 0.294933 | 0.006 | 0.021507 | 0.035760 | 132.183691 | 0.662693 | 0.102115 | True |
| 2000-03-31 | 0.906315 | 1.038180 | 0.104771 | 0.171145 | 0.009 | 0.024123 | 0.041459 | 97.355147 | 0.718623 | 0.145496 | True | |
| 2000-04-30 | 1.035811 | 0.948359 | 0.184872 | 0.320980 | 0.009 | 0.024123 | 0.041459 | 115.353363 | 0.718623 | -0.084428 | False | |
| 2000-05-31 | 0.954551 | 0.642126 | -0.086518 | 0.195759 | 0.009 | 0.024123 | 0.041459 | 105.373229 | 0.718623 | -0.327300 | False | |
| 2000-06-30 | 0.624926 | 0.800823 | -0.322908 | -0.267144 | 0.011 | 0.033252 | 0.055279 | 58.375098 | 0.662396 | 0.281468 | True | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| WMT | 2019-08-31 | 105.399599 | 109.697015 | -0.000996 | 0.079033 | 1.330 | 0.016381 | 0.056330 | 79.290920 | 2.340503 | 0.040773 | False |
| 2019-09-30 | 109.140178 | 113.940502 | 0.040207 | 0.131881 | 1.330 | 0.016381 | 0.056330 | 82.478959 | 2.340503 | 0.043983 | False | |
| 2019-10-31 | 114.103713 | 112.577210 | 0.038684 | 0.079370 | 1.260 | 0.015371 | 0.051332 | 90.428970 | 2.242809 | -0.013378 | False | |
| 2019-11-30 | 113.210853 | 114.334129 | -0.011965 | 0.067518 | 1.260 | 0.015371 | 0.051332 | 89.346992 | 2.242809 | 0.009922 | False | |
| 2019-12-31 | 114.391733 | 114.603719 | 0.015606 | 0.042272 | 1.260 | 0.015371 | 0.051332 | 90.741372 | 2.242809 | 0.001853 | False |
6770 rows × 11 columns
# show test data
print(
df_test.index.get_level_values("date").min(),
df_test.index.get_level_values("date").max(),
)
df_test
2020-01-31 00:00:00 2020-12-31 00:00:00
| adjOpen | adjClose | price_rate_of_change_1M | price_rate_of_change_3M | epsDil | return_on_assets | return_on_equity | price_to_earnings_ratio | debt_to_equity_ratio | return_month | target | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ticker | date | |||||||||||
| AAPL | 2020-01-31 | 72.882172 | 76.146912 | 0.098784 | 0.315005 | 0.758 | 0.040429 | 0.151247 | 95.309987 | 2.741004 | 0.044795 | False |
| 2020-02-29 | 74.865126 | 67.414954 | 0.054010 | 0.247904 | 1.248 | 0.065281 | 0.248361 | 61.015154 | 2.804470 | -0.099515 | False | |
| 2020-03-31 | 69.614769 | 62.711987 | -0.114673 | 0.025324 | 1.248 | 0.065281 | 0.248361 | 54.018393 | 2.804470 | -0.099157 | False | |
| 2020-04-30 | 60.790848 | 72.455785 | -0.069761 | -0.131954 | 1.248 | 0.065281 | 0.248361 | 50.249989 | 2.804470 | 0.191886 | True | |
| 2020-05-31 | 70.593835 | 78.616414 | 0.155374 | -0.048474 | 1.248 | 0.065281 | 0.248361 | 58.057520 | 2.804470 | 0.113644 | True | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| WMT | 2020-08-31 | 126.402983 | 135.654958 | 0.080314 | 0.069299 | 1.400 | 0.017132 | 0.058470 | 89.933393 | 2.326817 | 0.073194 | True |
| 2020-09-30 | 137.950883 | 136.690566 | 0.077424 | 0.123800 | 1.400 | 0.017132 | 0.058470 | 96.896398 | 2.326817 | -0.009136 | False | |
| 2020-10-31 | 137.560087 | 135.557259 | 0.007634 | 0.172842 | 2.270 | 0.027281 | 0.085991 | 60.216109 | 2.073895 | -0.014560 | False | |
| 2020-11-30 | 137.354919 | 149.274188 | -0.008291 | 0.076648 | 2.270 | 0.027281 | 0.085991 | 59.716854 | 2.073895 | 0.086777 | True | |
| 2020-12-31 | 150.065549 | 141.350206 | 0.101189 | 0.100396 | 2.270 | 0.027281 | 0.085991 | 65.759554 | 2.073895 | -0.058077 | False |
360 rows × 11 columns
Let's create a classifier estimator and fit it on the train data. I am using the default hyperparameters except for is_unbalance which is set to True (given the high-class imbalance of the dataset) and max_depth, num_leaves, and min_child_samples that are set to "appropriate" values according to the lightGBM documentation.
Feel free to experiment with other hyperparameters!
from lightgbm import LGBMClassifier
# define classifier
estimator = LGBMClassifier(
is_unbalance=True,
max_depth=4,
num_leaves=8,
min_child_samples=400,
n_estimators=50,
)
# fit classifier on training data
estimator.fit(df_train[features], df_train["target"])
LGBMClassifier(is_unbalance=True, max_depth=4, min_child_samples=400,
n_estimators=50, num_leaves=8)
Once the model has been fitted on the training data, we can use it to make predictions on the test data. A new column buy is created in df_test, it contains the predictions made by the model.
# make prediction using test data
df_test["buy"] = estimator.predict(df_test[features])
# display data
df_test[["return_month", "target", "buy"]]
| return_month | target | buy | ||
|---|---|---|---|---|
| ticker | date | |||
| AAPL | 2020-01-31 | 0.044795 | False | False |
| 2020-02-29 | -0.099515 | False | False | |
| 2020-03-31 | -0.099157 | False | False | |
| 2020-04-30 | 0.191886 | True | True | |
| 2020-05-31 | 0.113644 | True | False | |
| ... | ... | ... | ... | ... |
| WMT | 2020-08-31 | 0.073194 | True | False |
| 2020-09-30 | -0.009136 | False | False | |
| 2020-10-31 | -0.014560 | False | False | |
| 2020-11-30 | 0.086777 | True | False | |
| 2020-12-31 | -0.058077 | False | False |
360 rows × 3 columns
# display only the stocks with buy=True
df_test.loc[df_test["buy"] == True][["return_month", "target", "buy"]]
| return_month | target | buy | ||
|---|---|---|---|---|
| ticker | date | |||
| AAPL | 2020-04-30 | 0.191886 | True | True |
| AMGN | 2020-02-29 | -0.070369 | False | True |
| 2020-03-31 | 0.014157 | False | True | |
| 2020-06-30 | 0.030451 | False | True | |
| 2020-11-30 | 0.008922 | False | True | |
| ... | ... | ... | ... | ... |
| WBA | 2020-11-30 | 0.114243 | True | True |
| 2020-12-31 | 0.039021 | False | True | |
| WMT | 2020-02-29 | -0.062837 | False | True |
| 2020-03-31 | 0.060722 | True | True | |
| 2020-07-31 | 0.083298 | True | True |
140 rows × 3 columns
Now that we have predictions on the test set, we can move on to the evaluation part, where we are going to assess the performance of the model.
We are going to evaluate the performance of the model in two ways:
To get an overall idea of the performance of the classifier, we will use the classification_report from sklearn.
The overall accuracy is 61%. The model does quite a fairly good job at predicting the False class (73% precision, 66% recall) but is less good at predicting the True class (42% precision, 50% recall).
We should be careful when using accuracy in this case, given the high-class imbalance of the dataset.
from sklearn.metrics import classification_report
print(classification_report(df_test["target"], df_test["buy"]))
precision recall f1-score support
False 0.73 0.66 0.69 241
True 0.42 0.50 0.46 119
accuracy 0.61 360
macro avg 0.57 0.58 0.57 360
weighted avg 0.63 0.61 0.62 360
Let's now focus more on the financial performance of the model. That is, would we have been able to make money with this model?
To do that, we take the following assumptions:
n different stocks (depending on the model predictions)1/n)With those assumptions, we can easily compute the monthly return of the strategy and then calculate financial metrics like total return or Sharpe ratio.
We start by selecting only the stocks for which the model made a positive prediction (buy).
# select only the stocks that were picked by the model
df_buy = df_test.loc[df_test["buy"] == True][["return_month", "target", "buy"]]
df_buy
| return_month | target | buy | ||
|---|---|---|---|---|
| ticker | date | |||
| AAPL | 2020-04-30 | 0.191886 | True | True |
| AMGN | 2020-02-29 | -0.070369 | False | True |
| 2020-03-31 | 0.014157 | False | True | |
| 2020-06-30 | 0.030451 | False | True | |
| 2020-11-30 | 0.008922 | False | True | |
| ... | ... | ... | ... | ... |
| WBA | 2020-11-30 | 0.114243 | True | True |
| 2020-12-31 | 0.039021 | False | True | |
| WMT | 2020-02-29 | -0.062837 | False | True |
| 2020-03-31 | 0.060722 | True | True | |
| 2020-07-31 | 0.083298 | True | True |
140 rows × 3 columns
We then aggregate the data at month level to have an overview of how many stocks the model picked per month, and how much the average return was. We can use the mean return per month because we took the assumption that we would be investing 1/n on each selected stock.
df_results = (
df_buy.reset_index()
.groupby("date")
.agg({"ticker": "count", "return_month": "mean"})
)
df_results
| ticker | return_month | |
|---|---|---|
| date | ||
| 2020-01-31 | 3 | 0.048522 |
| 2020-02-29 | 9 | -0.085939 |
| 2020-03-31 | 24 | -0.122494 |
| 2020-04-30 | 27 | 0.148161 |
| 2020-05-31 | 12 | 0.064157 |
| 2020-06-30 | 5 | 0.084158 |
| 2020-07-31 | 9 | 0.000687 |
| 2020-08-31 | 9 | 0.095477 |
| 2020-09-30 | 8 | -0.038285 |
| 2020-10-31 | 15 | -0.044311 |
| 2020-11-30 | 12 | 0.152646 |
| 2020-12-31 | 7 | 0.052516 |
We can use the describe() function to get some statistics.
df_results.describe()
| ticker | return_month | |
|---|---|---|
| count | 12.000000 | 12.000000 |
| mean | 11.666667 | 0.029608 |
| std | 7.227892 | 0.088410 |
| min | 3.000000 | -0.122494 |
| 25% | 7.750000 | -0.039792 |
| 50% | 9.000000 | 0.050519 |
| 75% | 12.750000 | 0.086988 |
| max | 27.000000 | 0.152646 |
The number of stocks picked per month ranges from 3 to 27 and the average return per month is 2.96%.
Let's also compute the Sharpe ratio, which is a very common metric to use to assess the return of an investment compared to its risk.
import numpy as np
def sharpe(s_return: pd.Series, annualize: int, rf: float = 0) -> float:
"""
Calculate sharpe ratio
:param s_return: pd.Series with return
:param annualize: int periods to use for annualization (252 daily, 12 monthly, 4 quarterly)
:param rf: float risk-free rate
:return: float sharpe ratio
"""
# (mean - rf) / std
sharpe_ratio = (s_return.mean() - rf) / s_return.std()
# annualize
sharpe_ratio = sharpe_ratio * np.sqrt(annualize)
return sharpe_ratio
sharpe_ratio = sharpe(df_results["return_month"], annualize=12)
print(f"Sharpe ratio: {round(sharpe_ratio, 2)}")
Sharpe ratio: 1.16
Sharpe ratio of 1.16, not bad :)
To visualize the return over time, we first need to calculate the cumulative return.
# by using the monthly return, we can calculate the cumulative return over the entire year
df_results["return_month_cumulative"] = (df_results["return_month"] + 1).cumprod() - 1
df_results
| ticker | return_month | return_month_cumulative | |
|---|---|---|---|
| date | |||
| 2020-01-31 | 3 | 0.048522 | 0.048522 |
| 2020-02-29 | 9 | -0.085939 | -0.041586 |
| 2020-03-31 | 24 | -0.122494 | -0.158986 |
| 2020-04-30 | 27 | 0.148161 | -0.034381 |
| 2020-05-31 | 12 | 0.064157 | 0.027570 |
| 2020-06-30 | 5 | 0.084158 | 0.114049 |
| 2020-07-31 | 9 | 0.000687 | 0.114814 |
| 2020-08-31 | 9 | 0.095477 | 0.221252 |
| 2020-09-30 | 8 | -0.038285 | 0.174496 |
| 2020-10-31 | 15 | -0.044311 | 0.122453 |
| 2020-11-30 | 12 | 0.152646 | 0.293791 |
| 2020-12-31 | 7 | 0.052516 | 0.361736 |
We can then make some nice plots using Plotly.
import plotly.express as px
# plot monthly return
fig = px.bar(df_results, y="return_month", title="Monthly return (%)")
fig.show()
# plot cumulative return
fig = px.line(df_results, y="return_month_cumulative", title="Cumulative return")
fig.show()
Those first results don't look bad at all, even with the covid-19 crisis around March, we end up in 2020 with a +36% return, this is very promising :)
However, to have a complete picture, we need to put these results in perspective and compare them with a benchmark strategy. As we used the 30 stocks of the Dow Jones, we will use an ETF that is tracking the same index: DIA.
Let's load the data of this ETF.
# load the historical price DIA (benchmark strategy)
df_benchmark = pd.read_csv("data/prices_DIA.csv")
df_benchmark
| date | adjOpen | adjClose | return_month | return_month_cumulative | |
|---|---|---|---|---|---|
| 0 | 2020-01-31 00:00:00+00:00 | 274.265829 | 270.544375 | -0.013569 | -0.013569 |
| 1 | 2020-02-29 00:00:00+00:00 | 271.760972 | 244.532466 | -0.100193 | -0.112402 |
| 2 | 2020-03-31 00:00:00+00:00 | 246.597774 | 211.254631 | -0.143323 | -0.239615 |
| 3 | 2020-04-30 00:00:00+00:00 | 202.986762 | 234.497751 | 0.155237 | -0.121576 |
| 4 | 2020-05-31 00:00:00+00:00 | 230.988287 | 245.649163 | 0.063470 | -0.065822 |
| 5 | 2020-06-30 00:00:00+00:00 | 245.156493 | 249.831985 | 0.019071 | -0.048006 |
| 6 | 2020-07-31 00:00:00+00:00 | 250.800814 | 256.312811 | 0.021978 | -0.027083 |
| 7 | 2020-08-31 00:00:00+00:00 | 257.592678 | 276.232473 | 0.072362 | 0.043318 |
| 8 | 2020-09-30 00:00:00+00:00 | 275.698043 | 270.294625 | -0.019599 | 0.022870 |
| 9 | 2020-10-31 00:00:00+00:00 | 272.077108 | 258.283073 | -0.050699 | -0.028988 |
| 10 | 2020-11-30 00:00:00+00:00 | 262.024893 | 289.680622 | 0.105546 | 0.073498 |
| 11 | 2020-12-31 00:00:00+00:00 | 292.717864 | 299.217200 | 0.022203 | 0.097334 |
We compute the sharpe ratio for this benchmark strategy.
sharpe_ratio_benchmark = sharpe(df_benchmark["return_month"], annualize=12)
print(f"Sharpe ratio benchmark: {round(sharpe_ratio_benchmark, 2)}")
Sharpe ratio benchmark: 0.45
Sharpe ratio of 0.45, this is much lower than the one of the ML model!
Let's plot both strategies on the same graph to get a better idea of the difference.
import plotly.graph_objects as go
fig = go.Figure()
fig = fig.add_trace(
go.Scatter(y=df_results["return_month_cumulative"], name="ML Model"),
)
fig = fig.add_trace(
go.Scatter(y=df_benchmark["return_month_cumulative"], name="Benchmark")
)
fig.update_layout(
title="Cumulative Return",
)
fig.show()
The ML model follows the same trajectory as the benchmark strategy, which makes sense given that the set of stocks is limited to only 30 tickers.
However, it does seem that the model was able to distinguish and pick highly performant stocks, leading to a 3x higher return than the benchmark and a boosted Sharpe ratio, nice job!
I hope this introductory post was able to give some insights on how to use Machine Learning for Trading.
Feel free to play around with the notebook, try different algorithms and features, or improve the backtesting approach with weights, stop-loss, take-profit, fees, etc.
Watch out though, it is very easy to get a "good looking" backtest in Financial Machine Learning, something that is often stressed by financial professionals. In another post, I will write about the danger of repeated backtesting and show how easy it is to overfit a model and how nested-cross validation can help to reduce this risk.
If you have any questions or remarks, drop a comment, and follow me for more posts like this one!
You can also follow me on the trading platform eToro, where I use Machine Learning at the core of my investment strategy.
!jupyter nbconvert --to=pdf Notebook.ipynb
[NbConvertApp] Converting notebook Notebook.ipynb to pdf
C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\filters\datatypefilter.py:39: UserWarning: Your element with mimetype(s) dict_keys(['text/html']) is not able to be represented.
warn("Your element with mimetype(s) {mimetypes}"
[NbConvertApp] Support files will be in Notebook_files\
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Making directory .\Notebook_files
[NbConvertApp] Writing 84942 bytes to notebook.tex
[NbConvertApp] Building PDF
Traceback (most recent call last):
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
sys.exit(main())
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\jupyter_core\application.py", line 269, in launch_instance
return super().launch_instance(argv=argv, **kwargs)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
app.start()
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 369, in start
self.convert_notebooks()
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 541, in convert_notebooks
self.convert_single_notebook(notebook_filename)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 506, in convert_single_notebook
output, resources = self.export_single_notebook(notebook_filename, resources, input_buffer=input_buffer)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\nbconvertapp.py", line 435, in export_single_notebook
output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\exporter.py", line 190, in from_filename
return self.from_file(f, resources=resources, **kw)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\exporter.py", line 208, in from_file
return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 183, in from_notebook_node
self.run_latex(tex_file)
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 153, in run_latex
return self.run_command(self.latex_command, filename,
File "C:\Users\Maxime\miniconda3\envs\batmaxx-ml-model-trading-env\lib\site-packages\nbconvert\exporters\pdf.py", line 110, in run_command
raise OSError("{formatter} not found on PATH, if you have not installed "
OSError: xelatex not found on PATH, if you have not installed xelatex you may need to do so. Find further instructions at https://nbconvert.readthedocs.io/en/latest/install.html#installing-tex.